Performance of Fault Tolerant Networks of Workstations
نویسنده
چکیده
Functional or dataflow models of computationenable a program’s run-time system to determine which portions of it must be repeated when faults occur. Modifications were made to the run-time system of Cilk a threaded C to enable a Network of Workstations to tolerate fail-stop faults of individual processors or the network. The overheads needed to provide this fault tolerance are shown to be mainly CPU cycles and memory, with little additional network load being generated in the absence of faults. This makes it feasible to run long computations successfully on NoWs where ownership, control and distribution of the individual processors may be widely distributed.
منابع مشابه
Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing
Networks of workstations (NOWs) offer a cost-effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless chec...
متن کاملFault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing
Recently, an algorithm-based approach using diskless checkpointing has been developed to provide fault tolerance for high-performance matrix operations. With this approach, since fault tolerance is incorporated into the matrix operations, the matrix operations become resilient to any single processor failure or change with low overhead. In this paper, we present a technique called multiple chec...
متن کاملParallel Processing on Networks of Workstations: A Fault-Tolerant, High Performance Approach
One of the most sought after software innovation of this decade is the construction of systems using off-the-shelf workstations that actually deliver, and even surpass, the power and reliability of supercomputers. Many researchers are using conventional techniques such as RPC, DSM, replication, causal communications and other techniques to provide parallel computing facilities on workstation ne...
متن کاملFault - Tolerant Clusters of Workstations with Single System Image
he computing trend is moving from clustering highend mainframes to clustering desktop computers. This trend is triggered by the widespread use of PCs, workstations, Gigabit networks, and middleware support for clustering. This paper presents new approaches to achieving fault tolerance and single system image (SSI) in a workstation cluster. A multicomputer cluster is a collection of node compute...
متن کاملOn Synchronisation in Fault-Tolerant Data and Compute Intensive Programs over a Network of Workstations
An application structured as a fault-tolerant bag of tasks adapts easily to changing resources. To be represented by a single bag of tasks, a computation must decompose into purely independent tasks. The work summarised here investigates performance of structuring approaches applicable where this ideal is not possible, partly through analysis and partly through measurements of a realistic fault...
متن کامل